AVG() FunctionThe dataset: students GPA data of different terms.
John Smith, fall 2000, 3.9
John Smith, winter 2000, 3.8
John Smith, spring 2001, 4.0
John Smith, summer 2001, 3.9
Mary Clark, fall 2000, 3.5
Mary Clark, winter 2000, 3.0
Mary Clark, spring 2001, 3.8
Mary Clark, summer 2001, 3.9Calculate the average GPA of each student.
ROUND_TO() FunctionROUND() function, which returns the value of an expression rounded to an integer.The syntax:
ROUND_TO(val, digits)val: an expression whose result is type float or double: the value to round.digits: an expression whose result is type int: the number of digits to preserve.mode: an optional int specifying the rounding mode, according to the constants Java provides.
MAX() and MIN() FunctionsMAX() or MIN() requires a preceding GROUP ALL statement for global maximums or minimums and a GROUP BY statement for group maximums or minimums. SUBTRACT() FunctionsSUBTRACT() takes two bags as arguments and returns a new bag composed of the tuples of first bag are that not in the second bag.SUBTRACT() function will fit entirely into memory simultaneously; if this is not the case, it will still function but will be very slow.Find out the bag elements that are in the first bag but not in the second bag:
({(8,9),(0,1),(1,2)},{(8,9),(1,1)})
({(2,3),(4,5)},{(2,3),(4,5)})
({(3,7),(3,7)},{(2,2),(3,7)})
({(1,2),(3,4),(5,6),(7,8)},{(2,3),(1,2)})
ENDSWITH() and STARTSWITH() Functionstrue or false.Syntax:
ENDSWITH(string, testAgainst)
STARTSWITH(string, testAgainst)Examples:
ENDSWITH ('foobar', 'foo') --> false
ENDSWITH ('foobar', 'bar') --> true
STARTSWITH ('foobar', 'foo') --> true
STARTSWITH ('foobar', 'bar') --> falseLTRIM(), RTRIM() and TRIM() FunctionsLTRIM(): Returns a copy of a string with only leading white space(s) removed.RTRIM(): Returns a copy of a string with only trailing white space(s) removed.TRIM(): Returns a copy of a string with leading and trailing white space(s) removed.SUBSTRING() FunctionSyntax:
SUBSTRING(string, startIndex, stopIndex)string: the string from which a substring will be extracted.startIndex: the index (type int) of the first character of the substring.stopIndex: the index (type int) of the character following the last character of the substring.Example:
SUBSTRING("Cornell", 0, 4) --> "Corn"TOTUPLE() FunctionSyntax:
TOTUPLE(expression [, expression ...])Example:
a = LOAD 'students' AS (name:chararray, age:int, gpa:float);
DUMP a;
(John,18,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,18,3.8)
b = FOREACH a GENERATE TOTUPLE(name, age, gpa);
DUMP b;
((John,18,4.0))
((Mary,19,3.8))
((Bill,20,3.9))
((Joe,18,3.8))TOBAG() FunctionSyntax:
TOBAG(expression [, expression ...])Example:
a = LOAD 'students' AS (name:chararray, age:int, gpa:float);
DUMP a;
(John,18,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,18,3.8)
b = FOREACH a GENERATE TOBAG(name, gpa);
DUMP b;
({(John),(4.0)})
({(Mary),(3.8)})
({(Bill),(3.9)})
({(Joe),(3.8)})TOMAP() FunctionSyntax:
TOMAP(key-expression, value-expression[, key-expression, value-expression ...])Example:
a = LOAD 'students' AS (name:chararray, age:int, gpa:float);
DUMP a;
(John,18,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,18,3.8)
b = FOREACH a GENERATE TOMAP(name, gpa);
DUMP b;
([John#4.0])
([Mary#3.8])
([Bill#3.9])
([Joe#3.8])register command.hadoop fs -copyFromLocal ..., or Ambari’s Files View).poem.txt File with a Python UDF in PigThere is Another Sky
Emily Dickinson
There is another sky,
Ever serene and fair,
And there is another sunshine,
Though it be darkness there;
Never mind faded forests, Austin,
Never mind silent fields -
Here is a little forest,
Whose leaf is ever green;
Here is a brighter garden,
Where not a frost has been;
In its unfading flowers
I hear the bright bee hum:
Prithee, my brother,
Into my garden come!
pyudf0.py in the vi editor.You need to expose this function as a UDF with a Python decorator by adding the @outputSchema decorator. This specifies a schema for the return value.
@outputSchema("length:int")
def get_length(data):
length = len(data)
return length
pyudf1.pyAdd "# of characters = " and "# of words = " before the respective values.
@outputSchema("nums:chararray")
def get_length(data):
words = data.split()
num_chars = len(data)
num_words = len(words)
num = "# of characters = " + str(num_chars), "# of words = " + str(num_words)
return nums